Ordering Requests: Part I

Requirements for order#

Replica coordination requires (alongside the agreement property) the order property, which states that every non-faulty state machine replica processes requests in the same relative order. One way to implement order is to assign unique identifiers to each request. This way, we can manage the order in which replicas process requests in the unique identifiers' total order.

With replicas processing requests according to the order of their unique identifiers, we define a request as stable at a state machine replica smism_i if a non-faulty client cannot deliver a request with a lower unique identifier to smism_i. We can enforce the order requirement if every replica processes stable requests with the smallest unique identifier. This is called order implementation. The order implementation will require some request handling that enables replicas to maintain a list (or any data structure) of unprocessed stable requests and remove processed requests from that list.

Order implementation requires a method to assign unique identifiers to requests. This method also needs to be constrained by the assumptions that a client can make about the order in which state machines process requests:

  1. [O1O1] A state machine processes requests by one of its clients in the order the client issued the requests. For example, a client cc issues requests to a state machine smsm in the following order: request AA, then request BB, and then request CC. In this case, smsm will process AA first, then BB, and then CC.

  2. [O2O2] If a client c0c_0​ makes a request r0r_0​ to state machine smsm that causes another client c1c_1​ to make a request r1r_1​ to smsm, then smsm will process r0r_0​ before r1r_1​.

These two assumptions do not necessarily mean that smsm will process requests in the order they were made or received.

Let's discuss three ways in which we can implement order implementation. Replicas will need a test to classify a request as stable. We will have a stability test section for all our order implementations where we explain how replicas can check whether a request is stable.

Note: You can review how unique identifiers can be generated in the Sequencer chapter.

Order implementation#

Method 1: Logical clocks#

A logical clock is a function that maps events to integers. A logical clock T^\hat{T} maps an event ee to a number T^(e)\hat{T}(e). It maps events such that for any two distinct events ee and ee', T^(e)\hat{T}(e) and T^(e)\hat{T}(e') cannot have the same value—either T^(e)\hat{T}(e) is greater than or lesser than T^(e)\hat{T}(e'). Furthermore, if ee causes ee', T^(e)\hat{T}(e) is less than T^(e)\hat{T}(e').

Implementation#

We can implement logical clocks in a distributed system by associating a counter T^p\hat{T}_p with each process pp. When pp sends a message, it includes a timestamp, which is the value of T^p\hat{T}_p when pp sends the message. The value of T^p\hat{T}_p is updated as follows:

  1. Process pp increases T^p\hat{T}_p after each event at pp.

  2. When pp receives a message with a timestamp τ\tau, pp resets T^p\hat{T}_p based on the following equation:

T^p = max(T^p,τ) + 1\hat{T}_p\space=\space\max(\hat{T}_p,\tau)\space+\space1
Example of a logical clock
Example of a logical clock

The figure above shows how our implementation works with three processes pp, qq, and rr. The dots represent events. Arrows connect dots such that the event at the tail might be responsible for causing the event at the head. For example, an arrow starts from an event from one processor and ends on an event from a different processor, the starting event is the sending of a message, and the ending event is the receiving of that message. The value of T^p(e)\hat{T}_p(e) for each event is above its dot.

For an event ee that occurs at processor (node) pp, let T^(e)\hat{T}(e) be a fixed length bitstring that uniquely identifies pp appended to the value of T^p(e)\hat{T}_p(e) when ee occurs. T^(e)\hat{T}(e)will uniquely identify an event ee such that O1O1 and O2O2 are satisfied.

Stability test#

Now that we understand the use of logical clocks for order implementation,  let's devise a stability test to allow replicas to test whether a request is stable.

Note: The method of using logical clocks for order implementation will not work in an asynchronous environment (where network messages can take arbitrarily long to be delivered and nodes can take arbitrarily long to process, and hence it is impossible to detect if a node has failed). Additionally, the methods of this subsection also don't apply to Byzantine failures. This section only applies to fail-stop failures where messages take a bounded time to deliver on the network.

We will require a mechanism to determine if a node or message has been delayed for a particular duration to classify it as faulty. We will discuss failure detection for fail-stop failures and make the following assumptions:

  1. FIFO channels assumption: Messages between nodes are delivered in the order the sender sent them. This is equivalent to saying that nodes have FIFO channels between them.

  2. Failure detection assumption: A node pp can only detect that another node qq has failed after receiving the last message sent by qq to pp.

The second assumption is consistent with the first since fail-stop failure will occur at a node after it has sent its last message, which its recipient will receive after all other messages.

Based on the above assumptions, we can use the following stability test:

  1. Every client makes a null request to the state machine after a set time.

  2. A request will be stable at replica smism_i if it has received a request with a larger timestamp from every non-faulty client.

Created with Fabric.js 3.6.6
Five clients send requests to a state machine. The state machine orders the requests according to their unique identifiers, and the order of arrival in the request feed box is right to left. Unique identifiers generated by requests are in the pink boxes of the respective clients. So far, only one client has sent some requests.

1 of 8

Created with Fabric.js 3.6.6
The requests sent by c1 are considered non-stable by the state machine since the state machine does not have request identifiers from all clients. The state machine cannot declare any request stable; the smallest timestamp on the logical clocks of all clients is unknown. Three clients send null requests (highlighted in pink).

2 of 8

Created with Fabric.js 3.6.6
The null requests only update the last request identifiers of the respective clients; they have no other impact. The state machine still cannot deem any non-stable request stable since all clients have not made a request.

3 of 8

Created with Fabric.js 3.6.6
The state machine adds requests 007c3 and 002c4 to the list of non-stable requests and updates the last request identifier for the corresponding clients. Now that we have requests sent from all clients, the state machine knows that the smallest identifier from requests is 001c0. Currently, there is no request for which the state machine has received a larger timestamp from all clients. As a result, no request can be deemed stable. We see more null requests arriving.

4 of 8

Created with Fabric.js 3.6.6
These null requests update the last request identifiers for c3 and c4, but there is no request for which the state machine has received a larger timestamp from all clients. Therefore, no request can be deemed stable at this point. Let's see if that changes because of the incoming null request 002c0.

5 of 8

Created with Fabric.js 3.6.6
The smallest identifier from requests is now 002c0. This makes 001c1 stable. The state machine moves that request to the list of stable requests.

6 of 8

Created with Fabric.js 3.6.6
Clients give higher identifiers of their requests over time.

7 of 8

Created with Fabric.js 3.6.6
Requests keep becoming stable. Here, the lowest unique identifier received is 003. Therefore, all requests with lower unique identifiers will become stable if they already have not.

8 of 8

Any client cc that smism_i deems non-faulty will have a larger unique identifier than any of the previous requests made by cc. Due to the FIFO channels assumption, smism_i cannot receive a request with a smaller unique identifier from a non-faulty client. Therefore, all unique identifiers of requests that smism_i receives after the current stable request will always be larger than the unique identifier of the stable request. The failure detection assumption suggests that smism_i cannot receive a request from a failed client since its failure can only be detected after its last message has been sent, after which it will no longer send messages or requests. Therefore, smism_i will not receive a request with a lower unique identifier than the unique identifier of a stable request.

Method 2: Synchronized real-time clocks#

Another way to generate unique identifiers is by using approximately synchronized real-time clocks. If Tp(e)T_p(e) is the time (value on the real-time clock) at which event ee occurs at node pp, we can create a unique identifier for ee by using this Tp(e)T_p(e) and a unique fixed-length string to identify pp—similar to how we did with logical clocks.

We'll implement the following restrictions to enforce the assumptions (O1O1 and O2O2) that a client can make about the order in which a state machine processes requests:

  1. For O1O1, we assume a client cannot make two or more requests between successive clock ticks. For a resolution of RR seconds for processor clocks, every client can make at most one request every RR seconds. For two requests made by a client cc between successive clock ticks, a state machine could not tell which request was made first by cc.

  2. For O2O2, we assume clocks are synchronized to a better degree than the minimum message delivery time. If the difference in time on clocks on different nodes is within δ\delta seconds, then it must take more than δ\delta seconds for a message from a client to reach another. If we do not have this restriction, a client c1c_1 can request r1r_1 with a smaller unique identifier than a request r0r_0 that caused c1c_1 to make r1r_1.

Stability test#

We will formulate a stability test using the bounds on message delivery time. We will assume that Δ\Delta is a constant such that every non-faulty replica will receive a request rr no later than the time given by adding the request's unique identifier with Δ\Delta in the receiving replica's clock.

Once the clock on a node, say node pp, reaches a time τ\tau, pp cannot receive a request such that the request's unique identifier is less than the difference between τ\tau and Δ\Delta (τΔ\tau-\Delta).

A request rr will be stable at the state machine replica smism_i if the clock of smism_i reads τ\tau such that the unique identifier of request rr is less than the difference of τ\tau and Δ\Delta (τΔ\tau-\Delta).

The problem with this stability test is that it forces a state machine to lag behind its clients by Δ\Delta, which is the worst-case message delivery time. We can avoid this by implementing the FIFO channel requirement mentioned earlier. With FIFO channels, a state machine replica can only receive requests with a unique identifier greater than previous requests. With this, we can have another stability test.

A request rr will be stable at smism_i if a request with a larger unique identifier than rr's unique identifier has been received from every client.

The second test is not passed if a node refuses to make requests. We can combine the two tests, and a request is considered stable if it passes either test. In that case, the lag Δ\Delta only takes place if network delays or faulty nodes force it to occur.

Note: Placing an upper bound on the network messages is not an easy task. Google Spanner's TrueTime API achieved that feat by ensuring that clock times remain within a known bound.

Point to ponder

Question

In practical terms, what is one of the challenges of implementing synchronized real-time clocks?

Hide Answer

Synchronizing real-time clocks with reasonably short bounds is challenging. Although Spanner’s TrueTime has achieved it, it requires maintaining multiple atomic and GPS clocks. Such machinery might not be available to everyone (though it might change in the future due to the services like clocksync that provide TrueTime-like facilities as a service).

Additionally, there can be a few instances when the time-bound can increase momentarily, even for a TrueTime setup. During those scenarios, different components of the systems artificially slow down to provide consistency guarantees. With a potentially variable Δ\Delta, the stability test can fail.

What’s next?#

So far, we have discussed protocols for generating unique identifiers for requests where the client proposes the unique identifier. In the next lesson, we will discuss a protocol where state machine replicas propose and agree on a unique identifier for a request.

Replication and Coordination of State Machines

Ordering Requests: Part II